0 



DOCUMENT RESUME 



ED 227 137 

AUTHOR 
TITLE 

PUB DATE"* 
NOTE 



PUB TYPE' 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



the 



TM 830 131 

Schmitt, Neal 

Formula Estimation of Cross-Validated Multiple 
Correlation • 
Aug 82 

}3p.; Paper presented at the Annual Meeting of ^ 
American Psychological Association (90th f Washington 
DC f August the, functional 23-27, 1982). 
Speeches/Conference Papers (150) — Reports - 
Research/Technical ( 143 ) 

MF01/PC01 Plus Postage. 

*Cor relation ; *Est imat ion (Mathemat ics ) ; 
^Mathematical Formulas; Mathematical Models; 
*Multiple Regression Analysis; Predictive 
Measurement; Research Methodology; Weighted Scores 
*Cross Validation; Predictive Models 



ABSTRACT 

A review of cross-validation shrinkage formulas is 
presented which focuses on the theoretical and practical problems in 
the use of various formulas. Practical guidelines for use of both 
formula^ and empirical cross-validation are provided. A comparison of 
results using these formulas in a , ratfge dIT Situations is. then 
presented. The result of thes* comparisons indicate that one should 
use Cattin's formula to estimate cross-validated R, employing either 
Wherry or Olkin-Pratt estimates g£ the population R . If examination 
of 'predictor-criterion correlations has occurred prior to regression 
analysis, use empirical cross-validation, or adjust p to indicate the 
original "tiumber of variables examined. Double cross-validation is 
: Considered inefficient and unsatisfactory, and a cautionary remark 
concerning the functional number of predictors is presented. 
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Estimates of Cross-Validated Multiple Correlation 
• ' • • , 1 

Abstract 

A review of cross-validation shrinkage formulas is presented which focuses on 
the theoretical and practical problems in the use of various formulas. A 
comparison of results using these formulas in a range of situations is then 
presented. The result of these comparisons is that use of Cattin's formula 
is recommended. Double cross-validation is considered inefficient and un- 
satisfactory and a cautionary remark concerning the functional number of t 
predictors is presented. 
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Formula Estimatiorf^of Cross-Val idated 
Multiple Correlation 

s * 

In 1931, Wherry published a formula subsequently used to estimate the 
multiple correlation between actual measures on some criterion variable and 
predicted values of that same variable. The predictions, of course, are made 
using regression weights developed in a sample from the same population con- 
cerning which the predictions are made. Wherry himself recognized that his 
formula really was an estimate of what the multiple correlations between pre- 
dicted and actual criterion values would be if one had the population or true 
regression weights instead of those derived from some fallible sample. Because 
of the recognition that the formula was conceptually inappropriate (Wherry, 1951) 
and because of bad experiences with applications, of the formula (Guion, 1965), 
most authors concerned with the stability of their prediction equations used 
actual empirical cross-validation of the type described by Mosier (1951). 
Brieflv, Mosier proposed splitting the sample in half, computing regression 
equations and associated multiple correlations in both halves, and then using 
the regression equations developed in one half to make predictions' about values 
of the criterion in* the other half. Correlations between actual and predicted 
values for these two cross-validations were averaged to provide an estimate of 
the cross-validity. Mosier's procedure was called double cross-validation. 

In 1977, Schmitt, Coyle, and Rauschenberger evaluated the performance of 
the Wherry estimate and two other similar- formulas (Darlington, 1968; 
Nicholson, 1960) and the double cross-validation technique. The evaluation of 
these four methods was done in a Monte Carlo study 1 using (1) the difference 
between the estimated cross-validated R and the actual population cross-validity 
and (2) the. standard deviation of these estimates. Several guidelines for the 
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usage qf these formulas were presented, most significantly that actual empirical 
cross-validation was inefficient and likely to be in greater error in any single 

application than any of the formulas including the tfherry formula. 

x * 
Since that time several papers have appeared which have raised issue 

with the appropriateness of the formulas evaluated by Schmitt et al. (1977)^ 

Rozeboom (1978) indicated their conceptual inadequacy and Rozeboom (1978) as 

well as Drasgow, Dorans, and Tucker (1979) have shown that for low levels of 

multiple correlation not sampled by Schmitt et al.' (1977), the formulas, 

particularly Darlington's, produced a severe negative bias. That is, for low 

levels of sample multiple correlation, the formula estimates of cross-validated 

multiple correlation were muoh too low. 

Since that time there has also been general agreement (Cattin 1980a ; 

1980b; Rozeboom, 1978- 1981) that a fourth formula* presented by Browne (1975) 

is mathematically correct. Table 1 is a 1 presentation of various formula estimates. 

Insert Table 1 about here 

As can be seen, the Browne formula is horrendous from a computational viewpoint. 

Hence recent efforts (Cattin, 1980a; Rozeboom, 1981) have focused on the develop- , 

ment and evaluation of shortcut formulas, which yield essentially the same varluis^ 

! *i 

as the formula presented by Browne (1975). 

The purpose of this paper is to present briefly some comparisons of these 5 iJ , 

^ 'i 

formulas, parts of which are available in the citations listed above. Second, 

I will attempt to provide practical guidelines for use of both formulas and 5> * t 

empirical cross-validation. 

Method , 
A range of possible sample squared multiple correlations (.1 to .9), 
sample sizes (40 to 240) and number of predictors (5 to 25) was selected to be 
reasonably representative- of applied research employing multiple regression. 
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The various formulas presented in Table 1 were then applied to these sample 
statistics to provide estimates of the population corss-validity in any given 
situation. 

Results and Discussion 

In Table 2, are cross-validity/ estimates based on the various proposed 

formulas (see Table 1). Various levels of sample multiple correlation (R) , 
« 

Insert Table 2 about here 

sample size (N) , and number of predictors (P) , are used in these computations. 

If one examines Table 2, it becomes obvious that for relatively large N/P 

ratios there are larger differences, as various authors -have pointec} o'ut 

(Rozeboom, 1978; ,Drasgow, Dorans, & Tucker, 1979), and they are likely pra.cti- 

cally important differences. 

The other factor that is extremely important practically is that the 

Nicholson and Darlington formula fail for low levels of multiple correlation 

<(R^.6) which is precisely the levels of multiple correlation typically found^ 

in applied situations. This failure, of course, is the one noted by various 

authors cited above. Finally, even the Cattin and Rozeboom alternatives 

2 

produce impossible results when the N/P ratio and R is small. The underlined 

values in Table 2 are illustrative of this problem. 

The most significant conclusion to be drawn from these results as well 

as the other cited literature on this topic is that Cattin 1 s formula is the 

most appropriate estimate of the cross-validated multiple correlation. Note 

that Cattin' s formula requires the use of Wherry 1 s formula to calculate the 

population multiple correlation. Further, it seems appropriate that use of 

v 

even their formula be restricted to instaaces in which N/P is greater than 2 

2 % ' 

especially when R is low (<.6). 

At least one other practical issue remains. Is it better to use the Cattin 

formula or empirical cross-validation? My answer is that empirical 4 cross- 
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validation is not only a waste ^ time, it is is less satisfactory than any* 
formula estimate. The reason for this was displayed ^in Table 3 of Schmitt, 
Coyle, and Rauschenberger (1977). Empirical cross-validation, since it is 
based on substantially less than the total sample, is associated with greater 
variance across replications than are formula estimates. So, in any given 
instance, we can be much more wrong in our estimate with empirical cross- 
validation than if we had applied one of the formulas available. 

A final note of caution in the use of formula estimates of cross-validation 
is that they assume there has been no "data-snooping" prior to the calculation 
of the sample regression equation. The procedure in some studies is to compute 
zero-order correlations between a criterion and a large number of predictors, 
pick those variables which are significantly related to the criterion and compute 

a regression equation and multiple correlation using this subset of significant 

<■ * ' ' 

predictors. The functional p in this case is not the number of predictors in 

the regression equation but the total number of potential pre'dictor s* yfor which 
correlations were observed. 

Conclusions , 
The conclusions are simple: 1) use the Cat tin formula to estimate cross- 
validated R employing either Wherry or Olkin-Pratt estimates of the population K 
(see Cattin, 1980a for details); and 2) if examination of predictor-criterion 
correlations has occurred prior to regression analysis, use empirical cross- 

Q 

validation or adjust p to indicate the original number of variables examined. 
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Table 1 



Author 



Summary of Cross-Validation Formula 



Vsnerrv ^19 31) 



N Lchoison ( i9 60) 



Ro.'tDooiw ( 19 78) 



Cattin (1980a, 1980b) 



Brown (19 75) 



pc = 1 



p - - 1 - 



iCN-p-n 



/ N - 2\ fN+lA „2 I 

p^2))^rJ (1 " R) J 



V 



(N-p-3) + ^ 
(N-2p-2) + p 



L(N-2p-2) p' + pj 



ft 



Estimates multiple R when we have 
population weights • 

Developed for fixed and random effects 
models respectively - both suffer 
negative bias > .1 when N/P < 2. 



First portion of Browne formula (1975) . 
p must be estimated separately by a 
formula such as Wherry's above or in 
cases of very low N, a formula provided 
by OXkin and Pratt (1958) . 



2(N-p-2) (N-2p-6) (p-1) / (Is 7 )' 
(N-p-A) j(N-2p-2) p 2 * + p] 1 



+ 0 . N-p 



— # ) 

l In all formulas, R = sample multiple correlation, N = sample size, p = number of predi/ctor variables, 
i 

. = population multiple correlation, 5 c = population cross-validity. 
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Table 2 

7 " \ 

Fstimates of >■ Based on Wherry, Nicholson, Rozebotfm 



Darlington, and CatLin Formulas for 
Various Combinations of R 2 , N, and p 



r> *V 



? 


N 


P 








^O i 




.9 




c 
J 


ftQ 




.86 


.87 


.88 


.9 


80 


c 

J> 


ftQ 


. o ? 


.89 


.89 


.89 


.9 


240 




on 




90 


. 90 


.90 


.9 


40 


10 


ft 7 

. o / 


ft 1 


81 

• O 1 


. 82 


.83 


.9 


80 


10 


QQ 


ft 7 


87 


.87 


».87 


.9 


240 


10 


q n 


ftQ 
. o y 


QQ 
• o y 


. 89 


.89 ' 


.9 


40 


25 


. 1 J 


^A 


1 7 


.41 


.44 


.9 


. 80 


2 5 


. O J 


ft 1 


78 


. 79 


. 79 




240 * 


2 5 


OA 


ft ft 

. 0 O 


88 


.88 ' 


.88 


.8 * 


40 


c 

J 


7 7 


7 A 


• • / j 


. 74 


,75 


.8 


80 


5 




7 7 


7 7 


77 


. 78 


.8 


240 


5 - 


OA 


7 O 

. fy 


70 
. fy 


79 
* i y 


. 79 


.8 


40 


10 


. /J 


. 00 






.67 


.8 


80 


10 


. 7 7 


. /4 


7 1. 


111 


. 74 


.'8 


240 


10 


. /9 


7 ft 


7ft 
. / o 


78 


.78 


.8 


40 


2 5 




nft 

. U o 


- A 7 


. 13 


. 1 5 


.8 


80 


25 


* 

7 1 


. 0 1 


• JO . 


. 59 


.60 


.8 


240 


25 


. 78 


. 7 5 


7 


7 ^ 


7 5 


.6 


40 


5 


.54 • 


.47 


.46 
r 


.48 ' 


.51 


.6 


80 


5 


.57 


.54 


.54 


.55 ■ 


.55 


.6 


240 


5 


.59 


.58 


.58 


. 58 . ■ 


.59 


.6 ' 


40 


10 


.46 


. 31 




:33 


.36 


.6 


'80 


10 


.54 


.48 


.47 - ' 


.48 


* .49 


.6 


* 2 40 


10 


.58 


.56 ' 


.56 


,57 


.57 


.6 " 


40 


25 


-.11 4 


-.84 


a 


.01 


.00 


.6 


80 


25 


.42 


.23 


.13 


.25 


' .26 


.6 


240 


25 


.55 


.51 


.50 


.51 


.51 


.4 


40 


5 


.31 


.21 


.19 


.23 ' 


.26 
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R 2 


• 

N 


P 


o 

4^ ^ J . 


✓ 


.4 


80 


' 5 


1£v 


. 31 


.31 


.32 


.33 


.4 


240 


5 


1Q 


. 37 


.37 


.37 


.38 


.4 


40 


10 


1 Q 

. 17 


-.03 


-.12 


.08 


.10 


.4 


80 


10 




. 22 


.20 


.24 


.24 


.4 

i 


240 


10 


17 


IS ' 


.34 


.35 


. .35 


.4 


40 


25 








.18 


.17 


.4 


80 








-.31 


.03 


.03 


.4 


240 


P 


11 




.25 


.27 


.27 


.2 


40 


5 


• U O s 




-.08 


.03 


.04 


.2 


. 80 


5 


. 1 _) 


Oft 


.08 


. 10 


.11 


.2 


- 240 


5 


. 1 o 


1 6 


. 16 


.17 


.17 


.2 


40 


10 


- oa 
— . u o 


- 17 


- . 50 


.02 


.01 


.2 


80 


10 


na 


- OA 


-.06 


.03 


.04 


.2 


240 


10 


l 7 


1 1 


.13 


.14 


.14 


.2 


40 


25 








.49 


_.5J_ 


.2 


80 


25 


— 1 7 
— . 1 / 




- 74 


.08 


.07 


«2 


240 


25 


. 1 1 


i n 
. ID 


O 9 
— . U L 


OS 


.06 


.1 


40 


5 


. -.03 


-.19 


-.22 


'.01 


.00 


.1 


80 


5 


.04 


-.03 


-.04 ^ 


.02 


.02 


.1 


240 


5 


.08 


.06 


.06 


.07 


.07 


.1 


40 


10 


-.21 


-.54 


-.68 


.20 


.19 


.1 


80 


10 


--.03 


-.17 


-.20 


.01 


.04 


.1 


• 240 


10 


.02 


-.07 


-.08 


.00 


.04 


.1 


40 


25 








.69 


.73 


.! 


80 


25 


-.32 


-.75 


-.96 


.33 


.30 




240 


25 


-.01 


-.11 


-.13 


.00 


.00 



a Values in cases with a blank were less th a an -1.00. 
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